Label propagation in a distributed system

ABSTRACT

Data are maintained in a distributed computing system that describe a graph. The graph represents relationships among items. The graph has a plurality of vertices that represent the items and a plurality of edges connecting the plurality of vertices. At least one vertex of the plurality of vertices includes a set of label values indicating the at least one vertex&#39;s strength of association with a label from a set of labels. The set of labels describe possible characteristics of an item represented by the at least one vertex. At least one edge of the plurality of edges includes a set of label weights for influencing label values that traverse the at least one edge. A label propagation algorithm is executed for a plurality of the vertices in the graph in parallel for a series of synchronized iterations to propagate labels through the graph.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of and claims the benefit of priorityto U.S. patent application Ser. No. 13/452,275, filed Apr. 20, 2012 andentitled “Label Propagation in a Distributed System, which claims thebenefit of U.S. Provisional Application No. 61/477,559, filed Apr. 20,2011, and U.S. Provisional Application No. 61/483,183, filed May 6,2011, the contents of which are hereby incorporated by reference.

BACKGROUND

Technical Field

This disclosure pertains in general to distributed computing and inparticular to using a distributed computing system to propagate labelsin a graph.

Background Information

In graph processing, a computing problem is represented by a graphhaving a set of vertices connected by a set of edges. The graph can beused, for example, to model a real-world condition, and then the graphprocessing can act on the graph to analyze the modeled condition. Forexample, the World Wide Web can be represented as a graph where webpages are vertices and links among the pages are edges. In this example,graph processing can analyze the graph to provide information to asearch engine process that ranks search results. Similarly, a socialnetwork can be represented as a graph, and graph processing can analyzethe graph to learn about the relationships in the social network. Graphscan also be used to model transportation routes, paths of diseaseoutbreaks, citation relationships among published works, andsimilarities among different documents. Additionally, graphs can be usedfor machine learning techniques that observe patterns in data in orderto adjust future behaviors, such as for spam detection.

Modeling real-world conditions such as those mentioned above involvesrepresenting a great deal of information within the graph, as well asupdating the graph as processing is performed or new information isreceived. For graphs modeling complex conditions, representing andupdating the information requires significant computing resources.

SUMMARY OF THE DISCLOSURE

The above and other needs are met by a method, a non-transitorycomputer-readable storage medium and a system for a label propagationalgorithm. Embodiments of the method comprise maintaining data in adistributed computing system. The data describe a graph that representsrelationships among items. The graph includes a plurality of verticesthat represent the items and a plurality of edges that connect theplurality of vertices and represent the relationships among the items.At least one vertex of the plurality of vertices includes a set of labelvalues that indicate the at least one vertex's strength of associationwith a label from a set of labels. The set of labels describe possiblecharacteristics of an item represented by the at least one vertex. Atleast one edge of the plurality of edges includes a label weight forinfluencing label values that traverse the at least one edge. The methodincludes executing a label propagation algorithm for the plurality ofvertices in the graph in parallel for a series of synchronizediterations to propagate labels through the graph. The operations of thelabel propagation algorithm for a respective vertex include receiving amessage that includes a weighted label value. The operations of thelabel propagation algorithm for the respective vertex include updating alabel value for a first label in the set of labels of the vertex basedon the weighted label value included in the received message to producean updated label value. The operations of the label propagationalgorithm for the respective vertex include sending a message to atarget vertex connected to the vertex by an edge, where the messageincludes a new updated label value. The new updated label value is theupdated label value weighted by the label weight of the edge connectingthe respective vertex to the target vertex. The method includesassigning labels from the set of labels to the plurality of verticesbased on label values associated with the plurality of vertices andoutputting the labels of the plurality of vertices.

Embodiments of the non-transitory computer-readable storage medium storeexecutable computer program instructions for performing the stepsdescribed above. Embodiments of the system comprise a processor and anon-transitory computer readable storage medium storingprocessor-executable computer program instructions. The computer programinstructions include instructions for performing the steps describedabove.

The features and advantages described in the specification are not allinclusive and, in particular, many additional features and advantageswill be apparent to one of ordinary skill in the art in view of thedrawings, specification, and claims. Moreover, it should be noted thatthe language used in the specification has been principally selected forreadability and instructional purposes, and may not have been selectedto delineate or circumscribe the inventive subject matter.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a high-level block diagram of a computing environment,according to one embodiment.

FIG. 2 is a high-level block diagram illustrating an example of acomputer, according to one embodiment.

FIG. 3 is a high-level block diagram illustrating modules within aworker system, according to one embodiment.

FIG. 4 is a flow diagram that illustrates a process for performing alabel propagation algorithm on a directed graph, according to oneembodiment.

FIG. 5A-C illustrates an example of an iteration of the labelpropagation algorithm on a directed graph in a distributed system,according to one embodiment.

The figures depict embodiments of the present disclosure for purposes ofillustration only. One skilled in the art will readily recognize fromthe following description that alternative embodiments of the structuresand methods illustrated herein may be employed without departing fromthe principles of the disclosure described herein.

DETAILED DESCRIPTION

FIG. 1 is a high-level block diagram of a distributed computingenvironment 100 for propagating labels in a graph. FIG. 1 illustrates aclient 102, a master system 105, distributed storage system 103, clustermanagement system 107, and worker systems 106 connected by a network104. Collectively, the distributed computing environment 100 may be usedto define a graph modeling real-world conditions as a set ofrelationships among a tangible set of items, such as, for example andwithout limitation, documents and links on the Internet, a computernetwork topology, transportation routes in a geographic map, likelihoodsthat emails are spam, or a social graph. In addition, the computingenvironment 100 may be used to analyze the modeled conditions in orderto solve one or more real-world problems associated with the conditions.

The definition and analysis of the real-world problem involves assigninglabels to vertices in the graph and propagating values describing thelabels through the graph. Once the analysis is performed, the labels ofthe vertices in the graph represent the solution to the modeledreal-world problem. For example, if part of the analysis involvesclustering together real-world items represented by the vertices of thegraph that have similar characteristics, a cluster may be represented bythe vertices that have the same label after the analysis is performed.

In an embodiment described herein, the distributed computing environment100 applies a label propagation algorithm to the graph. The labelpropagation algorithm uses parallelism provided by the distributedcomputing environment 100 to propagate labels through the graph. Thedistributed computing system, and algorithm, thus enable identificationof solutions to the real-world problems associated with the conditionsmodeled by the graph in a more efficient manner than would otherwise bepossible.

At a high-level, the client 102 is used to provide the location of graphdata describing the graph and to specify one or more algorithms to beperformed on the graph data. Assume for purposes of this discussion thatthe algorithms include a label propagation algorithm. In an embodiment,the graph is represented as a set of vertices connected by a set ofdirected edges. In an embodiment, the edges may not be directed. Thegraph data describing the graph may be stored on the distributed storagesystem 103. The master system 105 assigns partitions of the graph datato the worker systems 106. In some embodiments, the worker systems 106may retrieve and store copies of their assigned graph partitions. Theworker systems 106 execute the algorithm to propagate labels throughoutthe partitions of the graph within their respective ambits.

In more detail, the client 102 specifies a graph in which each vertex isuniquely identified by a string vertex identifier. For example, theclient 102 may provide information identifying the location of the graphon the distributed storage system 103 that is connected to the network104. The graph may be a directed graph or an undirected graph. Inaddition, the client 102 specifies a set of labels that may be appliedto the vertices. In one embodiment, the client 102 assigns labels tosome vertices while leaving the rest of the unlabeled vertices to belabeled by the label propagation algorithm. In one embodiment, theclient 102 specifies a set of label values for some of the vertices. Thelabel values for a given vertex measure that vertex's associations withthe labels in the set.

The directed edges are associated with their source vertices, and eachedge has a label weight and a target vertex identifier. The label weightindicates how much the source vertex's labels affect the target vertex'slabels. In some embodiments, an edge has a plurality of label weights,where each label weight corresponds to a respective label. Verticescommunicate directly with one another by sending messages along thedirected edges. A message may instruct the target vertex to update itslabels based on the values of the source vertex's labels and the labelweight of the edge connecting the vertices.

An exemplary computation includes initialization of a graph andexecution of the algorithm of the user program on multiple systems. Thealgorithm performs a sequence of supersteps or iterations separated byglobal synchronization points until the algorithm terminates andproduces an output. A superstep is an iteration of the computation thatincludes ordered stages of computation for each vertex in the graph.Within each superstep, the vertices compute in parallel, each executinga function defined in the user program that expresses the logic of analgorithm. A vertex can modify its state or that of its outgoing edges,receive messages sent to it in the previous superstep, send messages toother vertices (to be received in the next superstep), or even mutatethe topology of the graph.

The algorithm terminates when every vertex votes to halt. In superstep 0(the initial superstep), every vertex is in the active state; all activevertices participate in the computation of any given superstep. A vertexdeactivates itself by voting to halt. Halting means that the vertex hasno further work to do unless triggered externally, and that vertex willnot execute in subsequent supersteps unless it receives a message. Ifreactivated by a message, a vertex must explicitly deactivate itselfagain. The algorithm as a whole terminates when all vertices aresimultaneously inactive and there are no messages in transit.

The output of the algorithm is a set of values explicitly output by thevertices. The output represents a solution to the real-world problemassociated with the modeled conditions involving the set ofrelationships among the set of items. For example, each vertex mayoutput its name, its associated labels, and its values for the labels.The labels may represent a solution to the problem described as aclustering of real-world entities that have one or more labels incommon, a set of weighted features in a classifier or othermachine-learning system, a set of web pages on the Internet having agiven set of characteristics, etc.

Turning now to the specific entities illustrated in FIG. 1, the client102 is a computing device with a processor and a memory that includes anapplication 110 for providing the master system 105 and/or the clustermanagement system 107 with a user program and the location of the graphdata. The user program defines an algorithm that propagates labelsthrough a graph described by the graph data. The application 110 sends acopy of the user program to the master system 105 and/or the clustermanagement system 107. The application 110 also sends graph data or alocation of the graph data to the master system 105.

The distributed storage system 103 includes one or more systems that maystore the graph data. The distributed storage system 103 may provide thegraph data to the systems connected to network 104 (i.e., client 102,master system 105, cluster management system 107, and worker system106). In some embodiments, the graph data is stored as a plurality ofgraph partitions, where a graph partition stores data describing asubset of the edges and vertices of a directed graph. In one embodiment,the distributed storage system 103 stores a file for each graphpartition. The distributed storage system 103 stores the solution to thelabel propagation algorithm which is output by the vertices of thegraph. In some embodiments, the distributed system 103 stores a file pereach graph partition containing the output from the vertices of thepartition.

The cluster management system 107 is a computing device with a processorand a memory. In some embodiments, the cluster management system 107receives a copy of a user program from the client 102 and sends a copyof the user program to the worker systems 106. In some embodiments, thecluster management system 107 coordinates the parallel execution of theuser program on the worker systems 106 and reports the results of theexecution to the client 102.

The master system 105 is likewise a computing device with a processorand a memory. In some embodiments, the master system 105 receivesinformation identifying the graph data on the distributed storage system103 and assigns partitions of the graph data to the worker systems 106.More specifically, the master system 106 sends each worker system 106information that uniquely describes its assigned graph partition andinformation enabling the worker system 106 to obtain its assigned graphpartition. For example, the master system 106 sends a worker system 106a unique file name corresponding to its assigned graph partition and thelocation of the file on the distributed file system 103. A worker system106 may be assigned one or more graph partitions.

The coordination module 114 maintains a list of worker systems 106 thatparticipate in a computation. The worker systems 106 send registrationmessages to the master system 105 and the coordination module 114registers the worker systems 106 by assigning unique identifiers to theworker systems 106. The coordination module 114 maintains a list of theregistered worker systems 106 which includes the identifiers of theregistered worker systems 106 and the addressing information of theregistered worker systems 106. For a respective registered worker system106, the list includes information identifying one or more assignedgraph partitions. In some embodiments, the coordination module 114 sendseach worker system 106 the list of the registered worker systems 106.

In some embodiments, the coordination module 114 assigns one or morepartitions to each worker system 106, and sends each worker system 106information identifying its assigned one or more partitions. A partitionof a graph includes a subset of the vertices and edges of the graph. Insome embodiments, the coordination module 114 determines the number ofgraph partitions. The number of partitions may be specified in the userprogram or determined by a partition function stored in the coordinationmodule 114. For example, the default partitioning function may be a hashof a vertex identifier modulo N, where N is the number of partitions.The master system 105 may not be assigned any portion of the graph.

In some embodiments, the coordination module 114 sends each workersystem 106 a copy of the user program and initiates the execution of theuser program on the worker systems 106. In some embodiments, thecoordination module 114 signals the beginning of a superstep. Thecoordination module 114 maintains statistics about the progress of acomputation and the state of the graph, such as the total size of thegraph, the number of active vertices, the timing of recent supersteps,and the message traffic of recent supersteps.

The coordination module 114 also handles fault tolerance. Faulttolerance is achieved through checkpointing. At the beginning of asuperstep, the coordination module 114 instructs the worker systems 106to save the state of their partitions to persistent storage, includingvertex values, edge values, and incoming messages. Worker failures aredetected through messages that the coordination module 114 periodicallysends to the worker systems 106. If the coordination module 114 does notreceive a reply message from a worker system 106 after a specifiedinterval, the coordination module 114 marks that worker system 106 asfailed. If a worker system 106 does not receive a message from thecoordination module 114 after specified time interval, the worker system106 terminates its processes. When a worker system 106 fails, thecurrent state of the partitions assigned to the worker system 106 islost. In order to recover from a worker system 106 failure, thecoordination module 114 reassigns graph partitions to the currentlyavailable set of worker systems 106 at the beginning of a superstep. Theavailable set of worker systems 106 reload their partition states fromthe most recent available checkpoint at the beginning of a superstep.The most recent available checkpoint may be several supersteps earlierthan the latest superstep completed by any worker system 106 before thefailure, which results in the missing supersteps being repeated. Thefrequency of checkpointing may be based on a mean time of failure whichthereby balances check pointing cost against expected recovery cost.

After the supersteps are finished, the coordination module 114aggregates results from the worker systems 106 and sends the results tothe distributed storage system 103. In some embodiments, the resultsinclude a set of values explicitly output by the vertices. These valuesdescribe, e.g., the label values for the vertices. In some embodiments,the coordination module 114 writes one result file per graph partitionand stores the result files with the graph partitions in the distributedstorage system 103. The coordination module 114 sends a notification tothe client 102 including the location of the results. The client 102 maythen implement the solution described by the labels, or provide thesolution to other systems for implementation.

A worker system 106 is a computing device with a processor and a memory.The worker systems 106 and the master system 105 are similar types ofsystems in one embodiment. A worker system 106 includes a worker module112 that stores one or more graph partitions. The worker module 112 mayobtain the one or more graph partitions from the distributed system 103.In some embodiments, the worker module 112 stores informationidentifying one or more graph partitions. In some embodiments, theworker module 112 stores one or more graph partitions. The worker module112 also stores and executes a copy of the user program on the one ormore partitions stored on the worker system 106.

The worker module 112 executes supersteps of a user program in responseto receiving instructions from the master system 105 and/or clustermanagement system 107. During a superstep, the worker module 112executes an algorithm for each active vertex in the one or morepartitions stored on the worker module 112. A vertex that is activeduring a superstep may send messages to other vertices in order toobtain information about other vertices or edges, to add or removevertices or edges, and to modify vertices or edges. During execution ofa superstep, the worker module 112 may retrieve and/or modify graph datastored on the distributed storage system 103. When the superstep isfinished, the worker module 112 sends a message to the master system 105indicating the number of vertices that will be active in the nextsuperstep. The superstep continues as long as there are active verticesor there are messages in transit. When the supersteps are finished, theworker module 112 sends the results generated from the user program tothe master system 105.

The worker module 112 may store the state of its assigned one or morepartitions. This may include the state of each vertex in the one or morepartitions where the state of each vertex consists of its current value,a list of its outgoing edges (which includes the vertex name for theedge's destination and the edge's current value), a queue containingincoming messages, and a flag specifying whether the vertex is active.

The network 140 represents the communication pathways between the client102, the master system 105 and the worker systems 106. In oneembodiment, the network 140 uses standard Internet communicationstechnologies and/or protocols. Thus, the network 140 can include linksusing technologies such as Ethernet, 802.11, integrated services digitalnetwork (ISDN), asynchronous transfer mode (ATM), etc. Similarly, thenetworking protocols used on the network 140 can include thetransmission control protocol/Internet protocol (TCP/IP), the hypertexttransport protocol (HTTP), the simple mail transfer protocol (SMTP), thefile transfer protocol (FTP), etc. The data exchanged over the network140 can be represented using technologies and/or formats including thehypertext markup language (HTML), the extensible markup language (XML),etc. In addition, all or some links can be encrypted using conventionalencryption technologies such as the secure sockets layer (SSL), SecureHTTP (HTTPS) and/or virtual private networks (VPNs). In anotherembodiment, the entities can use custom and/or dedicated datacommunications technologies instead of, or in addition to, the onesdescribed above.

FIG. 2 is a high-level block diagram illustrating physical components ofa computer 200 used as part of the client 102, master system 105 and/orworker system 106 from FIG. 1, according to one embodiment. Illustratedare at least one processor 202 coupled to a chipset 204. Also coupled tothe chipset 204 are a memory 206, a storage device 208, a keyboard 210,a graphics adapter 212, a pointing device 214, and a network adapter216. A display 218 is coupled to the graphics adapter 212. In oneembodiment, the functionality of the chipset 204 is provided by a memorycontroller hub 220 and an I/O controller hub 222. In another embodiment,the memory 206 is coupled directly to the processor 202 instead of thechipset 204. In some embodiments, memory 206 includes high-speed randomaccess memory, such as DRAM, SRAM, DDR RAM or other random access solidstate memory devices.

The storage device 208 is any non-transitory computer-readable storagemedium, such as a hard drive, compact disk read-only memory (CD-ROM),DVD, or a solid-state memory device. The memory 206 holds instructionsand data used by the processor 202. The pointing device 214 may be amouse, track ball, or other type of pointing device, and may be used incombination with the keyboard 210 to input data into the computer 200.The graphics adapter 212 displays images and other information on thedisplay 218. The network adapter 216 couples the computer 200 to thenetwork 140.

As is known in the art, a computer 200 can have different and/or othercomponents than those shown in FIG. 2. In addition, the computer 200 maylack certain illustrated components. For example, in one embodiment, acomputer 200 acting as a server may lack a keyboard 210, pointing device214, graphics adapter 212, and/or display 218. Moreover, the storagedevice 208 can be local and/or remote from the computer 200 (such asembodied within a storage area network (SAN)).

As is known in the art, the computer 200 is adapted to execute computerprogram modules for providing functionality described herein. As usedherein, the term “module” refers to computer program logic utilized toprovide the specified functionality. Thus, a module can be implementedin hardware, firmware, and/or software. In one embodiment, programmodules are stored on the storage device 208, loaded into the memory206, and executed by the processor 202.

Embodiments of the entities described herein can include other and/ordifferent modules than the ones described here. In addition, thefunctionality attributed to the modules can be performed by other ordifferent modules in other embodiments. Moreover, this descriptionoccasionally omits the term “module” for purposes of clarity andconvenience.

FIG. 3 is a high-level block diagram illustrating modules within theworker module 112 of a worker system 106, according to one embodiment.In some embodiments, the modules retrieve one or more partitions of thegraph stored on the distributed storage system 103, execute a copy ofthe user program, and modify the one or more retrieved partitions of thegraph responsive to operations of the user program.

The worker module 112 includes a partition module 302 that manages thedata in the partition database 304. The partition module 302 mayretrieve one or more graph partitions and store the retrieved partitionsin the partition database 304. In some embodiments, the partition module302 retrieves the one or more graph partitions from the distributedsystem 103 based on information received from the client 102 and/ormaster system 105. In some embodiments, the partition module 302receives information describing a partition of a graph and stores theinformation in the partition database 304. More specifically, thepartition module 302 receives information identifying one or more graphpartitions and the location of the graph partitions on the distributedstorage system 103. The partition module 302 also saves the state of thepartitions 306 in the partition database 304 in response to messagesfrom the master system 105.

The partition database 304 stores information for one or more graphpartitions 306 described above. The one or more graph partitions may becopies of graph partitions stored on the distributed storage system 103.In some embodiments, the partition database 304 stores informationidentifying the location of one or more graph partitions on thedistributed storage system 103. A graph partition stores information fora subset of the vertices and edges of the directed graph.

The information for the vertices include vertex names and vertex values.In one embodiment, the vertex values include label values which measurea vertex's association with one or more vertex labels. The labels maybe, e.g., arbitrary text strings and serve to describe a possiblecharacteristic of the real-world item represented by the associatedvertex. For example, if a vertex represents an item that can be coloredred, green, or blue, the vertex may have three labels, which may in turnbe represented by the strings “red”, “green”, and “blue.” The labels maybe, e.g., numbers where each number corresponds to a label.

A label value indicates the strength of a vertex's association with alabel. For example, the label value may be a real number between, andincluding, zero and one. In this example, a label value of zeroindicates that the vertex having the value has no association with thecorresponding label, a label value of one indicates that the vertex hasmaximum association with the label, and label values between zero andone describe a degree of association between the label and the vertex.Different embodiments may ascribe different interpretations of therelationship among the vertex, label, and label value. For example, inone embodiment a label value of one may indicate that a vertex has theassociated label, but a label value of less than one, or anotherthreshold, indicates that the vertex does not have the associated label.Likewise, other embodiments may use label values that are not numbers.In one embodiment, each vertex of the graph maintains a vector thatincludes a position holding a floating point value for each of thepossible labels.

The information for the edges includes edge destination names and edgevalues. An edge includes a label weight that affects how much each of asource vertex's labels affect the target vertex's corresponding label.For example, the label weight affects how much of a source vertex'slabel value for a first label affects the target vertex's label valuefor the first label. In some embodiments, an edge may store a pluralityof edge weights, where each edge weight corresponds to a differentlabel. More specifically, each label weight affects how much a sourcevertex label value affects a corresponding label value of a targetvertex. For example, a first weight may correspond to a first label anda second weight may corresponds to a second label. As used herein, thecombination of the label weight with the source vertex's label isreferred to as the “weighted label value.”

Different embodiments can represent the label weights and influence onlabel values in different ways. In one embodiment, a label weightimplicitly or explicitly indicates a mathematical weighting function toperform on the source vertex's label value. For example, the weightingfunction may multiply the label weight with the label value of thesource node to produce a weighted label value. Thus, if the label valueof the target node is 1.0, and the label weight is 0.5, the weightedlabel value is 0.5. The weighting function may also be a floor orceiling function, a threshold function, an averaging function, etc.

A message module 312 sends messages from one vertex to another vertexduring a superstep. A vertex may send messages to another vertex on adifferent worker system 106. The vertices may send messages to othervertices in order to obtain information about other vertices, to add orremove vertices or edges, and to modify values associated with verticesand edges. In one embodiment, the message module 312 stores and managesmessage queues for all of the vertices in a partition. In someembodiments, the message module 312 maintains a single incoming messagequeue for all of the vertices in the partition or all of the vertices inall partitions assigned to a worker system 106. The messages include amessage value and the name of the destination vertex. The message valuesmay include weighted label values.

In some embodiments, the message module 312 stores and manages anoutgoing message queue for all of the vertices in a partition. Themessages in the outgoing message queue may be transmitted once the queuereaches a threshold size. The message module 312 is also responsible forsending and responding to messages from the master system 105. Asdiscussed above, the master system 105 periodically sends messages tothe worker systems 106 to check on the status of a computation.

The label propagation module 314 propagates vertex labels through thevertices of the graph partitions 306 stored in the partition database304 according to a label propagation algorithm. As discussed above, thealgorithm is performed as a series of iterations called supersteps.During each iteration, every active vertex performs a set of operationssimultaneously which include receiving weighted label values fromconnected source vertices, updating its own labels values based on thereceived weighted label values, and sending updated label valuesadjusted by the weights of the edges to target vertices. Depending uponthe embodiment, the label propagation module 314 may instruct a vertexto vote to halt when specified conditions are reached. For example, thealgorithm might instruct a vertex to halt after a predefined number ofsupersteps are performed or after the label values for the vertex havenot changed for a specified number of supersteps (e.g., the values haveconverged).

The label propagation module 314 updates the label values for thevertices of the graph partitions 306 stored in the partition database304 based on any weighted label values received by the vertices. In oneembodiment, the label update module 315 updates the value for a givenlabel of a vertex by applying a mathematical updating function to thevalues of the received weighted label value and the existing label valuefor the vertex. The updating function may perform the same types ofoperations described above with respect to the weighting function. Theresult of the updating function is the updated label value for thatlabel and vertex. For example, the updating function may produce theupdated label value by averaging the existing value for a label with thereceived weighted value for the label.

The label propagation module 314 may save the output from the verticesto the distributed storage system 103. More specifically, the labelpropagation module 314 may save label values, label weights, andweighted label values to the distributed storage system 103. In someembodiments, the label propagation module 314 saves the output from thevertices to the partition database 304. The label propagation module 314may assign labels to vertices and save the assignments to thedistributed storage system 103. For example, the label propagationmodule 314 may save a file for a partition that includes the labelassignment of the vertices in the partition.

FIG. 4 is a flow diagram that illustrates a process for performing alabel propagation algorithm on a directed graph, in accordance with oneembodiment. This process 400 is performed by a server system (e.g.,worker system 106) having one or more processors and a non-transitorymemory. The memory stores one or more programs to be executed by the oneor more processors. The one or more programs include instructions forthe label propagation algorithm.

In this process 400, data for a directed graph are maintained 402 in adistributed computer system. The data for the directed graph aremaintained 402 in one or more graph partitions. The data describe adirected graph representing relationships among items. The directedgraph has vertices and edges connecting the vertices. The vertices eachinclude a set of labels and label values and the edges each include oneor more label weights. In some embodiments, the directed graph models areal-world condition and may represent, for example, a geographic map,computer network, or social network. In some embodiments, the real-worldcondition is analyzed in order to solve one or more real-world problemsassociated with the condition. In some embodiments, the set of labelvalues for a vertex are represented as a vector having a plurality ofpositions, where each position in the vector corresponds to a label inthe set of possible labels and where each position in the vector storesa value indicating a label value for a corresponding label.

The label propagation algorithm is executed 404 in parallel for aplurality of vertices in the graph in a series of synchronizediterations. The label propagation algorithm propagates label valuesthrough the graph according to the label weights to ultimately assignlabels to the vertices of the graph. In some embodiments, the labelpropagation algorithm is executed for each vertex in the graph inparallel. An iteration corresponds to a superstep discussed above.

In the algorithm, a vertex receives 406 one or more messages containingweighted labels values during an iteration. In one embodiment, theincoming messages received by the vertex are sent from source verticesthat have outgoing edges to the vertex. The weighted label valuesreceived from a source vertex may correspond to one or more labels. Theweighted label values are weighted by a label weight connecting thereceiving vertex and the source vertex. In some embodiments, theincoming messages were sent in a previous iteration and received at thebeginning of the current iteration.

The vertex updates 408 at least one of its label values based in part onthe weighted label values received in the messages. More specifically,the vertex uses a mathematical updating function to determine an updatedlabel value for one or more of its existing label values based on theexisting label values, a user defined parameter, and the receivedweighted label values. In some embodiments, an updated label value for alabel is expressed by the formula L+C1*Sum(M), where L is the currentlabel value for the label, C1 is a user defined parameter, and Sum(M) isthe sum of the weighted label values that correspond to the label andthat were received in messages from other vertices. The user definedparameter (i.e., C1) affects how much of the updated label value comesfrom the existing label value versus the weighted label values. When theuser defined parameter is large, the weighted label values from thereceived messages have a large influence on the updated label value.When the user defined parameter is smaller, the existing label value hasa greater influence on the updated label value. The updated label valuereplaces the existing label value. In some embodiments, after updatingits label values, a vertex normalizes its label weights so that they addup to one. More specifically, the vertex may modify the label weights ofits outgoing edges so that they add up to one.

The vertex updates 410 a neighboring vertex. More specifically, thevertex sends an outgoing message to a neighboring vertex via one or moreoutgoing edges. An outgoing message includes one or more weighted labelvalues. A weighted label value is an updated label value adjusted by alabel weight associated with the outgoing edge that the messagetraverses. The vertex uses a weighting function on the updated labelvalue and a label weight associated with an outgoing edge to produce anew weighted label value. For example, a weighted label value may be aproduct of an updated label value and a label weight. When an edgeincludes a single label weight, the label weight is applied to eachlabel value. In some embodiments, an edge includes multiple labelweights, where each label weight corresponds to a respective label andis applied to the corresponding label value. For example, if an edgeincludes label weight A and label weight B, label weight A is applied toupdated label value A and label weight B is applied to updated labelvalue B.

The algorithm terminates after a fixed number of iterations or when thevertex's label values do not change more than a predefined amount in thecurrent iteration. The number of iterations and the predefined amountmay be set by the user of the client device 102. For example, the usermay specify the number of iterations and/or the predefined amount asinput to the user program.

In one embodiment, after the algorithm terminates, the vertices of thegraph are assigned 412 labels based on the label values associated withthe vertices. For example, a vertex having a label value above athreshold value may be assigned a corresponding label. As the verticescorrespond to real-world items, the labeling of vertices indicates alabeling of real-world items that represents a solution to thereal-world problem associated with the conditions modeled by the graph.The assignment of labels to vertices representing the solution arereported 414 to the client 102 for storage, display and/or otherpurposes.

FIGS. 5A-5C illustrate the operations of an iteration of the labelpropagation algorithm on a directed graph 500. FIG. 5A shows directedgraph 500 before the label propagation algorithm is executed. As shownin FIG. 5A, the directed graph 500 includes vertex 502, vertex 504,vertex 506, directed edge 508, and directed edge 510. The verticesinclude label values for a two labels, “R” and “B.” As discussed above,the label values associated with a vertex measure the vertex'sassociation with one or more labels. For example, vertex 502 includes alabel value (i.e., 1) for the label “R” which measures the vertex'sassociation with the label “R.” The edges of the directed graph 500include a label weight that determines how much of a vertex's labelvalues affect another vertex's corresponding label values.

FIG. 5B illustrates the operations performed by vertex 502 during aniteration of the label propagation algorithm. In FIG. 5B, vertex 502determines a weighted label value (i.e., R:0.7) based on a label value(i.e., R:1) associated with the vertex and the label weight (i.e., 0.7)of edge 508. Vertex 502 applies a weighting function to the label value(i.e., R:1) and the label weight (i.e., 0.7) to produce the weightedlabel value (i.e., R:0.7). In the example of FIG. 5B, the weightingfunction multiplies the label value (i.e., R:1) by the label weight(i.e., 0.7) to produce the weighted label value (i.e., R:0.7). Vertex502 sends an outgoing message 512 to vertex 504 containing the weightedlabel value (i.e., R:0.7).

FIG. 5C illustrates the operations performed by vertex 504 in asubsequent iteration after vertex receives message 512. In FIG. 5C,vertex 504 performs an updating function on the weighted label valuereceived in message 512 which was sent by vertex 502. Using the formuladiscussed above, vertex 502 determines an updated label value for label“R” by adding its current label value for label “R” (i.e., 0) to theproduct of a user defined parameter (e.g., 1) and the weighted labelvalue (i.e. 0.7). In this example, the user defined parameter is set toone. Vertex 502 updates its label vale for label “R” to the updatedlabel value (i.e., 0.7). In this particular example, the updated labelvalue for label “B” would be zero since the received message does notinclude a label value for label “B” and vertex 502 has a label value ofzero for label “B.” After updating its label value with the updatedlabel value, vertex 504 generates and sends a message 514 to vertex 506containing a weighted label value. As discussed above a weighted labelvalue is an updated label value adjusted by a corresponding label weightof an outgoing edge. Vertex 504 performs the operations described in thediscussion of FIG. 5B to generate and send a message containing aweighted label value.

Some portions of the above description describe the embodiments in termsof algorithmic processes or operations. These algorithmic descriptionsand representations are commonly used by those skilled in the dataprocessing arts to convey the substance of their work effectively toothers skilled in the art. These operations, while describedfunctionally, computationally, or logically, are understood to beimplemented by computer programs comprising instructions for executionby a processor or equivalent electrical circuits, microcode, or thelike. Furthermore, it has also proven convenient at times, to refer tothese arrangements of functional operations as modules, without loss ofgenerality. The described operations and their associated modules may beembodied in software, firmware, hardware, or any combinations thereof.

As used herein any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having” or any other variation thereof, areintended to cover a non-exclusive inclusion. For example, a process,method, article, or apparatus that comprises a list of elements is notnecessarily limited to only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or. For example,a condition A or B is satisfied by any one of the following: A is true(or present) and B is false (or not present), A is false (or notpresent) and B is true (or present), and both A and B are true (orpresent).

In addition, use of the “a” or “an” are employed to describe elementsand components of the embodiments herein. This is done merely forconvenience and to give a general sense of the disclosure. Thisdescription should be read to include one or at least one and thesingular also includes the plural unless it is obvious that it is meantotherwise.

Upon reading this disclosure, those of skill in the art will appreciatestill additional alternative structural and functional designs for asystem and a process for processing digital documents and reformattingthem for display on client devices. Thus, while particular embodimentsand applications have been illustrated and described, it is to beunderstood that the present disclosure is not limited to the preciseconstruction and components disclosed herein and that variousmodifications, changes and variations which will be apparent to thoseskilled in the art may be made in the arrangement, operation and detailsof the method and apparatus disclosed herein without departing from thespirit and scope as defined in the appended claims.

What is claimed is:
 1. A computer-implemented method executed by one or more processors, the method comprising: maintaining data in a distributed computing system, the data describing a graph representing relationships among items, the graph having a plurality of vertices representing the items, and having a plurality of edges connecting the plurality of vertices representing the relationships among the items; identifying a new label value for a first vertex, the first vertex connected to a second vertex by an edge, the first vertex including a first label value indicating a strength of association between the first vertex and a particular label, the second vertex including a second label value indicating a strength of association between the second vertex and the particular label, and the edge including an edge label value indicating a degree to which the first label value affects the second label value, wherein the particular label, for which the first label value and the second label value respectively indicate a strength of association with the first vertex and the second vertex, represents a possible characteristic of the first vertex and the second vertex; in response to identifying the new label value: updating the first label value, indicating a strength of association between the first vertex and the particular label representing a possible characteristic of the first vertex and the second vertex, to the identified new label value; determining an updated label value for the second vertex based on the updated first label value for the first vertex and the edge label value, indicating a degree to which the first label value affects the second label value, for the edge connecting the first vertex to the second vertex; and updating the second label value, indicating a strength of association between the second vertex and the particular label representing a possible characteristic of the first vertex and the second vertex, to the updated label value; and outputting at least the second label value indicating a strength of association between the second vertex and the particular label.
 2. The method of claim 1, wherein the new label value is included in a set of label values associated with the first vertex, and the first vertex includes a set of labels including the particular label, the method further comprising: representing the set of label values for the first vertex as a vector having a plurality of positions, each position in the vector corresponding to a label in the set of labels and storing a value indicating a label value for the corresponding label.
 3. The method of claim 1, wherein updating the second label value includes applying an updating function to the edge label value and the second label value to produce the updated label value.
 4. The method of claim 1, wherein identifying the new label value for the first vertex includes receiving a message including the new label value.
 5. The method of claim 4, wherein updating the second label value to the updated label value includes sending a message to the second vertex including the updated label value.
 6. The method of claim 1, wherein determining the updated label value for the second vertex includes applying a weighting function to the edge label value and the new label value to produce the updated label value.
 7. The method of claim 1, further comprising: sending a message to a coordinating system indicating that the first vertex has completed an iteration of a series of synchronized iterations including updating the first and second label value; and receiving a signal to begin another iteration.
 8. A non-transitory, computer-readable medium storing instructions operable when executed to cause at least one processor to perform operations comprising: maintaining data in a distributed computing system, the data describing a graph representing relationships among items, having a plurality of vertices representing the items, and having a plurality of edges connecting the plurality of vertices representing the relationships among the items; identifying a new label value for a first vertex, the first vertex connected to a second vertex by an edge, the first vertex including a first label value indicating a strength of association between the first vertex and a particular label, the second vertex including a second label value indicating a strength of association between the second vertex and the particular label, and the edge including an edge label value indicating a degree to which the first label value affects the second label value, wherein the particular label, for which the first label value and the second label value respectively indicate a strength of association with the first vertex and the second vertex, represents a possible characteristic of the first vertex and the second vertex; in response to identifying the new label value: updating the first label value, indicating a strength of association between the first vertex and the particular label representing a possible characteristic of the first vertex and the second vertex, to the identified new label value; determining an updated label value for the second vertex based on the updated first label value for the first vertex and the edge label value, indicating a degree to which the first label value affects the second label value, for the edge connecting the first vertex to the second vertex; and updating the second label value, indicating a strength of association between the second vertex and the particular label representing a possible characteristic of the first vertex and the second vertex, to the updated label value; and outputting at least the second label value indicating a strength of association between the second vertex and the particular label.
 9. The computer-readable medium of claim 8, wherein the new label value is included in a set of label values associated with the first vertex, and the first vertex includes a set of labels including the particular label, the operations further comprising: representing the set of label values for the first vertex as a vector having a plurality of positions, each position in the vector corresponding to a label in the set of labels and storing a value indicating a label value for the corresponding label.
 10. The computer-readable medium of claim 8, wherein updating the second label value includes applying an updating function to the edge label value and the second label value to produce the updated label value.
 11. The computer-readable medium of claim 8, wherein identifying the new label value for the first vertex includes receiving a message including the new label value.
 12. The computer-readable medium of claim 11, wherein updating the second label value to the updated label value includes sending a message to the second vertex including the updated label value.
 13. The computer-readable medium of claim 8, wherein determining the updated label value for the second vertex includes applying a weighting function to the edge label value and the new label value to produce the updated label value.
 14. The computer-readable medium of claim 8, the operations further comprising: sending a message to a coordinating system indicating that the first vertex has completed an iteration of a series of synchronized iterations including updating the first and second label value; and receiving a signal to begin another iteration.
 15. A system comprising: memory for storing data; and one or more processors operable to perform operations comprising: maintaining data in a distributed computing system, the data describing a graph representing relationships among items, having a plurality of vertices representing the items, and having a plurality of edges connecting the plurality of vertices representing the relationships among the items; identifying a new label value for a first vertex, the first vertex connected to a second vertex by an edge, the first vertex including a first label value indicating a strength of association between the first vertex and a particular label, the second vertex including a second label value indicating a strength of association between the second vertex and the particular label, and the edge including an edge label value indicating a degree to which the first label value affects the second label value, wherein the particular label, for which the first label value and the second label value respectively indicate a strength of association with the first vertex and the second vertex, represents a possible characteristic of the first vertex and the second vertex; in response to identifying the new label value: updating the first label value, indicating a strength of association between the first vertex and the particular label representing a possible characteristic of the first vertex and the second vertex, to the identified new label value; determining an updated label value for the second vertex based on the updated first label value for the first vertex and the edge label value, indicating a degree to which the first label value affects the second label value, for the edge connecting the first vertex to the second vertex; and updating the second label value, indicating a strength of association between the second vertex and the particular label representing a possible characteristic of the first vertex and the second vertex, to the updated label value; and outputting at least the second label value indicating a strength of association between the second vertex and the particular label.
 16. The system of claim 15, wherein the new label value is included in a set of label values associated with the first vertex, and the first vertex includes a set of labels including the particular label, the operations further comprising: representing the set of label values for the first vertex as a vector having a plurality of positions, each position in the vector corresponding to a label in the set of labels and storing a value indicating a label value for the corresponding label.
 17. The system of claim 15, wherein updating the second label value includes applying an updating function to the edge label value and the second label value to produce the updated label value.
 18. The system of claim 15, wherein identifying the new label value for the first vertex includes receiving a message including the new label value.
 19. The system of claim 18, wherein updating the second label value to the updated label value includes sending a message to the second vertex including the updated label value.
 20. The system of claim 15, wherein determining the updated label value for the second vertex includes applying a weighting function to the edge label value and the new label value to produce the updated label value. 